Search CORE

1,410 research outputs found

Predicting Anatomical Therapeutic Chemical (ATC) Classification of Drugs by Integrating Chemical-Chemical Interactions and Similarities

Author: DN Georgiou
GA Watson
GP Zhou
GP Zhou
GP Zhou
H Gurulingappa
H Mohabatkar
H Mohabatkar
IW Althaus
J Andraos
J Lin
Kai-Yan Feng
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
Kuo-Chen Chou
L Hu
Lei Chen
M Dunkel
M Esmaeili
M Hattori
M Kanehisa
M Kanehisa
M Kuhn
Ozlem Keskin
P Jaccard
P Wang
Q Gu
R Sharan
T Huang
U Karaoz
Wei-Ming Zeng
WZ Lin
X Xiao
YD Cai
YD Cai
Yu-Dong Cai
ZC Wu
ZC Wu
Publication venue: Public Library of Science
Publication date: 13/04/2012
Field of study

The Anatomical Therapeutic Chemical (ATC) classification system, recommended by the World Health Organization, categories drugs into different classes according to their therapeutic and chemical characteristics. For a set of query compounds, how can we identify which ATC-class (or classes) they belong to? It is an important and challenging problem because the information thus obtained would be quite useful for drug development and utilization. By hybridizing the informations of chemical-chemical interactions and chemical-chemical similarities, a novel method was developed for such purpose. It was observed by the jackknife test on a benchmark dataset of 3,883 drug compounds that the overall success rate achieved by the prediction method was about 73% in identifying the drugs among the following 14 main ATC-classes: (1) alimentary tract and metabolism; (2) blood and blood forming organs; (3) cardiovascular system; (4) dermatologicals; (5) genitourinary system and sex hormones; (6) systemic hormonal preparations, excluding sex hormones and insulins; (7) anti-infectives for systemic use; (8) antineoplastic and immunomodulating agents; (9) musculoskeletal system; (10) nervous system; (11) antiparasitic products, insecticides and repellents; (12) respiratory system; (13) sensory organs; (14) various. Such a success rate is substantially higher than 7% by the random guess. It has not escaped our notice that the current method can be straightforwardly extended to identify the drugs for their 2nd-level, 3rd-level, 4th-level, and 5th-level ATC-classifications once the statistically significant benchmark data are available for these lower levels

Public Library of Science (PLOS)

Crossref

PubMed Central

FigShare

Imbalanced Multi-Modal Multi-Label Learning for Subcellular Localization Prediction of Human Proteins with Both Single and Multiple Sites

Author: A Hoglund
B Liao
CE Rasmussen
DN Georgiou
FM Li
Franca Fraternali
G Tsoumakas
GP Zhou
H Mohabatkar
H Mohabatkar
H Nakashima
HB Shen
HB Shen
HB Shen
HB Shen
HN Lin
Hong Gu
J Ma
J Ma
J Tian
J Yin
Jianjun He
JY Shi
K Imai
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KY Lee
L Chen
L Chen
L Hu
LJ Foster
LL Hu
M Esmaeili
MS Scott
O Emanuelsson
P Wang
P Wang
RE Schapire
S Briesemeister
S Hua
S Mei
S Mei
S Zhang
T Huang
T Huang
T Huang
T Liu
Wenqi Liu
WZ Lin
X Jiang
X Xiao
X Xiao
X Xiao
YH Zeng
YL Chen
YL Chen
Z He
Z Lu
ZC Wu
ZC Wu
Publication venue: Public Library of Science
Publication date: 08/06/2012
Field of study

It is well known that an important step toward understanding the functions of a protein is to determine its subcellular location. Although numerous prediction algorithms have been developed, most of them typically focused on the proteins with only one location. In recent years, researchers have begun to pay attention to the subcellular localization prediction of the proteins with multiple sites. However, almost all the existing approaches have failed to take into account the correlations among the locations caused by the proteins with multiple sites, which may be the important information for improving the prediction accuracy of the proteins with multiple sites. In this paper, a new algorithm which can effectively exploit the correlations among the locations is proposed by using Gaussian process model. Besides, the algorithm also can realize optimal linear combination of various feature extraction technologies and could be robust to the imbalanced data set. Experimental results on a human protein data set show that the proposed algorithm is valid and can achieve better performance than the existing approaches

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

Analysis and Prediction of the Metabolic Stability of Proteins Based on Their Sequential Features, Subcellular Locations and Interaction Networks

Author: A Madkan
A Ruepp
Andreas Hofmann
B Niu
C Chen
C Chothia
CA Minetti
DS Wishart
FM Li
G Pollastri
G Pollastri
H Ding
H Lin
H Lin
H Peng
H Wei
HB Shen
HB Shen
HC Yen
I Dubchak
I Dubchak
J Wang
JF Wang
JF Wang
JF Wang
JF Wang
JJ Chou
JL Fauchere
JR Schnell
K Gong
K Oxenoid
Kai-Yan Feng
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
Kuo-Chen Chou
L Cristian
L Li
LeLe Hu
LJ Jensen
MM Gromiha
P Martel
P Rice
PA Fields
Ping Wang
QS Du
R Grantham
R Lumry
R Sharan
RB Huang
RM Pielak
SF Altschul
SH White
T Huang
Tao Huang
TJ Kamerzell
TL Zhang
X Xiao
Xiangyin Kong
Xiao-He Shi
Yi-Xue Li
Yu-Dong Cai
Z Qian
Zhisong He
Publication venue: Public Library of Science
Publication date: 04/06/2010
Field of study

The metabolic stability is a very important idiosyncracy of proteins that is related to their global flexibility, intramolecular fluctuations, various internal dynamic processes, as well as many marvelous biological functions. Determination of protein's metabolic stability would provide us with useful information for in-depth understanding of the dynamic action mechanisms of proteins. Although several experimental methods have been developed to measure protein's metabolic stability, they are time-consuming and more expensive. Reported in this paper is a computational method, which is featured by (1) integrating various properties of proteins, such as biochemical and physicochemical properties, subcellular locations, network properties and protein complex property, (2) using the mRMR (Maximum Relevance & Minimum Redundancy) principle and the IFS (Incremental Feature Selection) procedure to optimize the prediction engine, and (3) being able to identify proteins among the four types: “short”, “medium”, “long”, and “extra-long” half-life spans. It was revealed through our analysis that the following seven characters played major roles in determining the stability of proteins: (1) KEGG enrichment scores of the protein and its neighbors in network, (2) subcellular locations, (3) polarity, (4) amino acids composition, (5) hydrophobicity, (6) secondary structure propensity, and (7) the number of protein complexes the protein involved. It was observed that there was an intriguing correlation between the predicted metabolic stability of some proteins and the real half-life of the drugs designed to target them. These findings might provide useful insights for designing protein-stability-relevant drugs. The computational method can also be used as a large-scale tool for annotating the metabolic stability for the avalanche of protein sequences generated in the post-genomic age

Public Library of Science (PLOS)

Crossref

PubMed Central

Novel Inhibitor Design for Hemagglutinin against H1N1 Influenza Virus by Core Hopping Method

Author: Anil Kumar Tyagi
AW Schuttelkopf
C Oostenbrink
CA Del Carpio
CW Ward
E De Clercq
ED Akten
H Wei
H Wei
HL Yen
HM Berman
J Stevens
JF Wang
JF Wang
JF Wang
JJ Irwin
JL Banks
JL McKimm-Breschkin
JR Schnell
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KL Hartshorn
Kuo-Chen Chou
L Cai
M Hendlich
M Uchida
MD de Jong
MD Eldridge
MF Boni
N Kolocouris
N Naffakh
NJ McDonald
NM Varki
PJ Goodford
PK Cheng
QH Liao
QS Du
R Schauer
RA Fouchier
RA Friesner
RM Pielak
Run-Ling Wang
S Sirois
Shu-Qing Wang
T Wang
TA Halgren
TT Chang
W Zhang
Wei-Ren Xu
Xiao-Bo Li
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

The worldwide spread of H1N1 avian influenza and the increasing reports about its resistance to the current drugs have made a high priority for developing new anti-influenza drugs. Owing to its unique function in assisting viruses to bind the cellular surface, a key step for them to subsequently penetrate into the infected cell, hemagglutinin (HA) has become one of the main targets for drug design against influenza virus. To develop potent HA inhibitors, the ZINC fragment database was searched for finding the optimal compound with the core hopping technique. As a result, the Neo6 compound was obtained. It has been shown through the subsequent molecular docking studies and molecular dynamic simulations that Neo6 not only assumes more favorable conformation at the binding pocket of HA but also has stronger binding interaction with its receptor. Accordingly, Neo6 may become a promising candidate for developing new and more powerful drugs for treating influenza. Or at the very least, the findings reported here may provide useful insights to stimulate new strategy in this area

CiteSeerX

Public Library of Science (PLOS)

Crossref

PubMed Central

Design Novel Dual Agonists for Treating Type-2 Diabetes by Targeting Peroxisome Proliferator-Activated Receptors with Core Hopping Approach

Author: A Rubenstrunk
BG Shearer
CG Ji
CS Gandhi
ED Rosen
F Blaschke
FJ Prado-Prado
FJ Prado-Prado
FJ Prado-Prado
G Liu
GA Kaminski
GG Kochendoerfer
HB Rubins
HB Shen
HE Xu
HE Xu
HM Berman
I Issemann
J Zhang
JF Wang
JF Wang
JJ Irwin
JN Feige
K Mochizuki
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Javangula
KD Singh
KJ Bowers
Kuo-Chen Chou
L Cai
L Michalik
L Yue
M Gangloff
MA Dea-Ayuela
MD Eldridge
MR Housaindokht
P Balakumar
P Cronet
P Markt
P Wang
PA Carpino
Peter Csermely
QH Liao
QS Du
R Concu
RT Nolte
Run-Ling Wang
S Ebdrup
S Sirois
Shu-Qing Wang
SN Lewis
T Husslein
TA Halgren
Wei-Ren Xu
WG Hoover
WL Jorgensen
WL Jorgensen
X Hou
X Xiao
X Xiao
XB Li
XY Liu
Ying Ma
Z Song
Publication venue: Public Library of Science
Publication date: 07/06/2012
Field of study

Owing to their unique functions in regulating glucose, lipid and cholesterol metabolism, PPARs (peroxisome proliferator-activated receptors) have drawn special attention for developing drugs to treat type-2 diabetes. By combining the lipid benefit of PPAR-alpha agonists (such as fibrates) with the glycemic advantages of the PPAR-gamma agonists (such as thiazolidinediones), the dual PPAR agonists approach can both improve the metabolic effects and minimize the side effects caused by either agent alone, and hence has become a promising strategy for designing effective drugs against type-2 diabetes. In this study, by means of the powerful “core hopping” and “glide docking” techniques, a novel class of PPAR dual agonists was discovered based on the compound GW409544, a well-known dual agonist for both PPAR-alpha and PPAR-gamma modified from the farglitazar structure. It was observed by molecular dynamics simulations that these novel agonists not only possessed the same function as GW409544 did in activating PPAR-alpha and PPAR-gamma, but also had more favorable conformation for binding to the two receptors. It was further validated by the outcomes of their ADME (absorption, distribution, metabolism, and excretion) predictions that the new agonists hold high potential to become drug candidates. Or at the very least, the findings reported here may stimulate new strategy or provide useful insights for discovering more effective dual agonists for treating type-2 diabetes. Since the “core hopping” technique allows for rapidly screening novel cores to help overcome unwanted properties by generating new lead compounds with improved core properties, it has not escaped our notice that the current strategy along with the corresponding computational procedures can also be utilized to find novel and more effective drugs for treating other illnesses

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Multi-Label Multi-Kernel Transfer Learning for Human Protein Subcellular Localization

Author: A Dijk
A Garg
A Hoglund
A Pierleoni
B Boeckmann
D Barrell
Francisco José Esteban
HB Shen
HB Shen
HB Shen
HB Shen
HB Shen
J Platt
K Lee
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
L Rajendran
L Zhu
M Mak
Q Yang
S Altschul
S Mei
S Mei
S Mei
S Pan
Suyu Mei
T Blum
T Tung
T Wu
W Dai
W Dai
W Huang
W Huang
X Xiao
X Xiao
Y Tu
Publication venue: Public Library of Science
Publication date: 13/06/2012
Field of study

Recent years have witnessed much progress in computational modelling for protein subcellular localization. However, the existing sequence-based predictive models demonstrate moderate or unsatisfactory performance, and the gene ontology (GO) based models may take the risk of performance overestimation for novel proteins. Furthermore, many human proteins have multiple subcellular locations, which renders the computational modelling more complicated. Up to the present, there are far few researches specialized for predicting the subcellular localization of human proteins that may reside in multiple cellular compartments. In this paper, we propose a multi-label multi-kernel transfer learning model for human protein subcellular localization (MLMK-TLM). MLMK-TLM proposes a multi-label confusion matrix, formally formulates three multi-labelling performance measures and adapts one-against-all multi-class probabilistic outputs to multi-label learning scenario, based on which to further extends our published work GO-TLM (gene ontology based transfer learning model for protein subcellular localization) and MK-TLM (multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization) for multiplex human protein subcellular localization. With the advantages of proper homolog knowledge transfer, comprehensive survey of model performance for novel protein and multi-labelling capability, MLMK-TLM will gain more practical applicability. The experiments on human protein benchmark dataset show that MLMK-TLM significantly outperforms the baseline model and demonstrates good multi-labelling ability for novel human proteins. Some findings (predictions) are validated by the latest Swiss-Prot database. The software can be freely downloaded at http://soft.synu.edu.cn/upload/msy.rar

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

NR-2L: A Two-Level Predictor for Identifying Nuclear Receptor Subfamilies Based on Sequence-Derived Features

Author: DJ Mangelsdorf
GP Zhou
GP Zhou
H Florence
H Mohabatkar
H Nakashima
JM Keller
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KK Kandaswamy
Kuo-Chen Chou
L Altucci
M Bhasin
M Masso
M Robinson-Rechavi
Niall James Haslam
PC Mahalanobis
Pu Wang
QB Gao
RR Joshi
SF Altschul
T Cover
T Liu
T Liu
T Wang
VD Gusev
W Li
W Liu
X Xiao
Xuan Xiao
Publication venue: Public Library of Science
Publication date
Field of study

Nuclear receptors (NRs) are one of the most abundant classes of transcriptional regulators in animals. They regulate diverse functions, such as homeostasis, reproduction, development and metabolism. Therefore, NRs are a very important target for drug development. Nuclear receptors form a superfamily of phylogenetically related proteins and have been subdivided into different subfamilies due to their domain diversity. In this study, a two-level predictor, called NR-2L, was developed that can be used to identify a query protein as a nuclear receptor or not based on its sequence information alone; if it is, the prediction will be automatically continued to further identify it among the following seven subfamilies: (1) thyroid hormone like (NR1), (2) HNF4-like (NR2), (3) estrogen like, (4) nerve growth factor IB-like (NR4), (5) fushi tarazu-F1 like (NR5), (6) germ cell nuclear factor like (NR6), and (7) knirps like (NR0). The identification was made by the Fuzzy K nearest neighbor (FK-NN) classifier based on the pseudo amino acid composition formed by incorporating various physicochemical and statistical features derived from the protein sequences, such as amino acid composition, dipeptide composition, complexity factor, and low-frequency Fourier spectrum components. As a demonstration, it was shown through some benchmark datasets derived from the NucleaRDB and UniProt with low redundancy that the overall success rates achieved by the jackknife test were about 93% and 89% in the first and second level, respectively. The high success rates indicate that the novel two-level predictor can be a useful vehicle for identifying NRs and their subfamilies. As a user-friendly web server, NR-2L is freely accessible at either http://icpr.jci.edu.cn/bioinfo/NR2L or http://www.jci-bioinfo.cn/NR2L. Each job submitted to NR-2L can contain up to 500 query protein sequences and be finished in less than 2 minutes. The less the number of query proteins is, the shorter the time will usually be. All the program codes for NR-2L are available for non-commercial purpose upon request

Crossref

Directory of Open Access Journals

PubMed Central

Prediction of Protein Domain with mRMR Feature Selection and Analysis

Author: AA Schaffer
AG Murzin
AK Dunker
AM Moses
AP Elhammer
B Saffari
Bi-Qing Li
Bin Xue
BQ Li
CA Orengo
D Chivian
D Li
DE Kim
E Angov
EC Mbamala
G Pugalenthi
GP Zhou
GP Zhou
H Ingolfsson
H Mohabatkar
H Peng
HB Shen
HB Shen
I Walsh
ID Campbell
IH Witten
J Chen
J Cheng
J Cheng
J Cheng
J Eickholt
J Lin
J Liu
J Liu
J Wang
JD Qiu
JE Gewehr
JJ Chou
JR Schnell
K Peng
K Shameer
K Wang
Kai-Yan Feng
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KK Kandaswamy
Kuo-Chen Chou
L Breiman
L Chen
L Holm
Le-Le Hu
Lei Chen
M Esmaeili
M Hayat
M Suyama
MJ Berardi
MK Yoon
N Nagarajan
N von Ohsen
NM Goldenberg
P Mundra
P Tompa
P Wang
PE Wright
PK Nielsen
Q Gu
R Apweiler
R Bondugula
R Guerois
R Linding
RA George
RA Poorman
S Gong
S Kawashima
S Roy
SC Jia
SF Altschul
SM Reynolds
T Ebina
T Huang
TA Holland
W Li
W Zhao
WR Atchley
WZ Lin
X Xiao
X Xiao
X Xiao
X Xiao
X Xiao
X Xiao
X Xiao
Y Zhang
YD Cai
YD Li
Yu-Dong Cai
YX Li
Z He
Z Qiu
ZC Wu
ZC Wu
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28–40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

Gene ontology based transfer learning for protein subcellular localization

Author: A Bateman
A Dijk
A Hoglund
A Hoglund
A Pierleoni
C Chen
C Leslie
C Leslie
DH Haft
E Marcotte
EM Zdobnov
F Corpet
FM Li
G Lanckriet
G Schneider
H Ding
H Lin
H Lin
H Liu
H Rangwala
H Shen
HB Shen
HB Shen
HB Shen
HB Shen
HB Shen
J Cedano
J Schultz
J Shen
JD Qiu
JD Qiu
K Chou
K Chou
K Chou
K Hofmann
K Lee
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
L Nanni
M Ashburner
M Esmaeili
M Mak
M Wang
Q Gu
Q Yang
R Apweiler
R Kuang
R Kuang
S Mei
S Pan
Shuigeng Zhou
Suyu Mei
T Blum
T Tung
TK Attwood
W Dai
W Dai
W Huang
W Huang
Wang Fei
X Jiang
X Xiao
XB Zhou
YH Zeng
YS Ding
YS Ding
Z Lei
Z Lu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as <it>GO</it>, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the <it>GO </it>terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology. Results In this paper, we propose a Gene Ontology Based Transfer Learning Model (<it>GO-TLM</it>) for large-scale protein subcellular localization. The model transfers the signature-based homologous <it>GO </it>terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false <it>GO </it>terms that are resulted from evolutionary divergence. We derive three <it>GO </it>kernels from the three aspects of gene ontology to measure the <it>GO </it>similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate <it>GO-TLM </it>performance against three baseline models: <it>MultiLoc, MultiLoc-GO </it>and <it>Euk-mPLoc </it>on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that <it>GO-TLM </it>achieves substantial accuracy improvement against the baseline models: 80.38% against model <it>Euk-mPLoc </it>67.40% with <it>12.98% </it>substantial increase; 96.65% and 96.27% against model <it>MultiLoc-GO </it>89.60% and 89.60%, with <it>7.05% </it>and <it>6.67% </it>accuracy increase on dataset <it>MultiLoc plant </it>and dataset <it>MultiLoc animal</it>, respectively; 97.14%, 95.90% and 96.85% against model <it>MultiLoc-GO </it>83.70%, 90.10% and 85.70%, with accuracy increase <it>13.44%</it>, <it>5.8% </it>and <it>11.15% </it>on dataset <it>BaCelLoc plant</it>, dataset <it>BaCelLoc fungi </it>and dataset <it>BaCelLoc animal </it>respectively. For <it>BaCelLoc </it>independent sets, <it>GO-TLM </it>achieves 81.25%, 80.45% and 79.46% on dataset <it>BaCelLoc plant holdout</it>, dataset <it>BaCelLoc plant holdout </it>and dataset <it>BaCelLoc animal holdout</it>, respectively, as compared against baseline model <it>MultiLoc-GO </it>76%, 60.00% and 73.00%, with accuracy increase <it>5.25%</it>, <it>20.45% </it>and <it>6.46%</it>, respectively. Conclusions Since direct homology-based <it>GO </it>term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, <it>GO-TLM</it>) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based <it>GO </it>term transfer and explicitly weighing the <it>GO </it>kernels substantially improve the prediction performance.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

'Unite and conquer': enhanced prediction of protein subcellular localization by integrating multiple specialized tools

Author: A Bulashevska
A Krogh
C Andreoli
C Guda
C Guda
CS Yu
E Badidi
E Frank
GE Tusnady
Gertraud Burger
H Bannai
H Shatkay
HB Shen
HB Shen
I Small
JL Heazlewood
JR Quinlan
JY Shi
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KJ Park
L Kall
M Bhasin
M Boden
MG Claros
MS Scott
N Pfanner
N Wiedemann
O Emanuelsson
P Donnes
QB Gao
S Džeroski
S Hua
S Matsuda
SHB Chou KC
T Hirokawa
T Zhang
W Li
X Xiao
Y Huang
Yao Qing Shen
YD Cai
YD Cai
YL Chen
YX Pan
Z Lu
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Knowing the subcellular location of proteins provides clues to their function as well as the interconnectivity of biological processes. Dozens of tools are available for predicting protein location in the eukaryotic cell. Each tool performs well on certain data sets, but their predictions often disagree for a given protein. Since the individual tools each have particular strengths, we set out to integrate them in a way that optimally exploits their potential. The method we present here is applicable to various subcellular locations, but tailored for predicting whether or not a protein is localized in mitochondria. Knowledge of the mitochondrial proteome is relevant to understanding the role of this organelle in global cellular processes. Results In order to develop a method for enhanced prediction of subcellular localization, we integrated the outputs of available localization prediction tools by several strategies, and tested the performance of each strategy with known mitochondrial proteins. The accuracy obtained (up to 92%) surpasses by far the individual tools. The method of integration proved crucial to the performance. For the prediction of mitochondrion-located proteins, integration via a two-layer decision tree clearly outperforms simpler methods, as it allows emphasis of biologically relevant features such as the mitochondrial targeting peptide and transmembrane domains. Conclusion We developed an approach that enhances the prediction accuracy of mitochondrial proteins by uniting the strength of specialized tools. The combination of machine-learning based integration with biological expert knowledge leads to improved performance. This approach also alleviates the conundrum of how to choose between conflicting predictions. Our approach is easy to implement, and applicable to predicting subcellular locations other than mitochondria, as well as other biological features. For a trial of our approach, we provide a webservice for mitochondrial protein prediction (named YimLOC), which can be accessed through the AnaBench suite at http://anabench.bcm.umontreal.ca/anabench/. The source code is provided in the Additional File <supplr sid="S2">2</supplr>. <suppl id="S2"> <title> Additional file 2 </title> <text> This file contains scripts for the online server YimLOC. Please note that there scripts only codes for the ready-to-use STACK-mem-DT described in the main text. The scripts do not provide the training process. </text> <file name="1471-2105-8-420-S2.pdf"> Click here for file </file> </suppl

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central